An incremental cluster-based approach to spam filtering

نویسندگان

  • Wen-Feng Hsiao
  • Te-Ming Chang
چکیده

As email becomes a popular means for communication over the Internet, the problem of receiving unsolicited and undesired emails, called spam or junk mails, severely arises. To filter spam from legitimate emails, automatic classification approaches using text mining techniques are proposed. This kind of approaches, however, often suffers from low recall rate due to the natures of spam, skewed class distributions and concept drift. This research is thus to propose an appropriate classification approach to alleviating the problems of skewed class distributions and drifting concepts. A cluster-based classification method, called ICBC, is developed accordingly. ICBC contains two phases. In the first phase, it clusters emails in each given class into several groups, and an equal number of features (keywords) are extracted from each group to manifest the features in the minority class. In the second phase, we capacitate ICBC with an incremental learning mechanism that can adapt itself to accommodate the changes of the environment in a fast and low-cost manner. Three experiments are conducted to evaluate the performance of ICBC. The results show that ICBC can effectively deal with the issues of skewed and changing class distributions, and its incremental learning can also reduce the cost of re-training. The feasibility of the proposed approach is thus justified. 2007 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Incremental Learning Based Framework for Image Spam Filtering

Nowadays, an image spam is an unsolved problem because of two reasons. One is due to the diversity of spamming tricks. The other reason is due to the evolving nature of image spam. As new spam constantly emerging, filters’ effectiveness drops over time. In this paper, we present an effective anti-spam approach to solve the two problems. First, a novel clustering filter is proposed. By exploring...

متن کامل

Incremental Immune-Inspired Clustering Approach to Behavior-Based Anti-Spam Technology

Facing new type of challenge which maintain clusters in a dynamic web environment with a high volume of updates and costly re-clustering, the paper describes a novel behavior-based anti-Spam technology based on incremental immune-inspired clustering algorithm. we use an “internal image” network to represent the input data set in order to reduce data redundancy, whilst at the same time extractin...

متن کامل

An Anti-spam Filter Combination Framework for Text-and-Image Emails through Incremental Learning

We present an anti-spam filtering framework that combines text-based and image-based anti-spam filters. First, an incremental learning approach to reducing mismatches between training and test datasets is proposed to resolve the problem of a lack of training data for legitimate emails that contain both text and images. Then, the outputs of text-based and image-based filters are combined with th...

متن کامل

Feature Weight Optimization Mechanism for Email Spam Detection based on Two-Step Clustering Algorithm and Logistic Regression Method

This research proposed an improved filtering spam technique for suspected emails, messages based on feature weight and the combination of two-step clustering and logistic regression algorithm. Unique, important features are used as the optimum input for a hybrid proposed approach. This study adopted a spam detector model based on distance measure and threshold value. The aim of this model was t...

متن کامل

A Novel Method of Text Clustering for Chinese Spam Based on Semantic Body

The effect of spam filtering method based on statistics is not good in filtering the new-type spam with synonymous substitution and camouflage. So a new text clustering method based on Semantic Body for filtering Chinese spam is proposed. In this paper, the word sense disambiguation, lexical chain based on HowNet and statistic-based TFIDF are adopted to extract features of mails. The Semantic B...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Expert Syst. Appl.

دوره 34  شماره 

صفحات  -

تاریخ انتشار 2008